1 Background

The goal of this tutorial is to give you a quick intro into the theory and practice of amplicon sequencing data and how to go from raw sequencing data to publication-quality figures/analysis. We will be focusing on 16S rRNA amplicon data, but you will see in later sessions, many of these concepts and techniques are relevant to other data types and fields!

For this session, we are going to emulate a fairly common scenario. The background of the data we will analyze today is that it is derived from gnotobiotic mice which were colonized using human samples from before and after a weight loss diet intervention. As is often the case in gnotobiotic fecal transplant experiments, we examined host phenotypes. Mice that received post-diet microbiomes demonstrated rapid and sustained weight loss (Panel A). Despite consuming similar amounts of food (Panel B), we found that the fecal caloric content was higher in post-diet mice suggesting a defect in nutrient absorption (Panel C). We used 16S rRNA amplicon sequencing of the V4 region to characterize the recipient microbiome to try to map which organisms could be responsible for the weight loss phenotype. We will be examining a subset of 3 mice from the pre-diet and post-diet groups at 4 time points across the experiment to generate a hypothesis about which microbes in the post-diet microbiota are responsible for weight loss.

Figure 1. (A) Weight loss in recipient mice. (B) Food consumption in recipient mice. (C) Fecal caloric content.
Figure 1. (A) Weight loss in recipient mice. (B) Food consumption in recipient mice. (C) Fecal caloric content.

2 Preparing For Class

Before our session, please complete the sections below. The goal is for you to have downloaded all of the data and software ahead of time so we can focus on concepts and data interpretation rather than troubleshooting. That being said… installing software is often the most difficult part of bioinformatics, so please email jordan.bisanz@psu.edu if you run into difficulties. You will need access to a mac/windows/Linux machine with at least ~150 MB of free storage. If you can’t access a computer, please let me know ahead of time and I can make an arrangement for you to remotely access a R studio session via your web browser.

2.1 Installing R and R Studio

We will be making use of the R programming language which is commonly used across many fields for statistical analysis. To begin, please install and/or update to version of R (4.3.2 by downloading from the following link. R is commonly accessed through an easy-to-use interface called R studio. This can be downloaded here. R and many of the packages you would use for data analysis are updated on a regular basis, please ensure you have the latest version of R.

If you are not familiar with R, it is an extremely useful tool for analyzing and visualizing data. If you are not familiar with it, please watch these videos before our class:

To ensure you have successfully installed the software, please open R studio and type the following code into your console:

R.Version()$version.string

The expected output that should be reported is: R version 4.3.2 (2023-10-31).

2.2 Installing Required Packages

Generally speaking, most packages for microbial ecology exist in 3 places: CRAN, Bioconductor, and Github. CRAN is the main R repository from which packages can be installed using install.packages(). Bioconductor is a repository specializing in bioinformatics. Bioconductor packages are installed via its own package (available from CRAN) called BiocManager which has a function called BiocManager::install(). Finally, the newest versions of packages or in-development packages are often found on github which can be installed using a package called devtools which offers devtools::install_github(). To install the required packages, copy and past the code below line by line into your R session. If asked to compile type yes and if asked to update packages types no.

install.packages("tidyverse") # A group of related packages to facilitate a wide variety of useful tasks including plotting
install.packages("vegan") # A commonly used ecology package offering diversity calculations and statistical tests
install.packages("ape") # A package for phylogenetic analysis offering several useful functions for microbial ecology
install.packages("lme4") # A package for advanced statistical testing
install.packages("lmerTest") # A package for advanced statistical testing

install.packages("BiocManager") # The package manager for Bioconductor
BiocManager::install("dada2") # A one-stop-shop for processing amplicon data
BiocManager::install("phyloseq") # A R package for microbiome analysis

install.packages("devtools") # A multi-purpose tool for developing packages
devtools::install_github("jbisanz/qiime2R") # A multi-purpose microbiome import/processing package written by yours truly

After you have installed all of the above packages, it is a good idea to try loading them one by one as below. If a package fails to load, read the error message and then try to reinstall. Please contact me if there are issues which you can’t resolve.

library(tidyverse)
library(dada2)
library(phyloseq)
library(ape)
library(vegan)
library(qiime2R)
library(lme4)

2.3 Downloading Data

We will start by creating a new directory for our project using the dir.create() function. I am going to place this on my desktop, but you might want to pick somewhere else depending on your computer:

dir.create("~/Desktop/BMMB521_tutorial") #Note: windows users would use dir.create("C:\Users\YourUserName\Desktop\BMMB521_tutorial")

Then we can specifically set our working directory to be inside this older as below:

setwd("~/Desktop/BMMB521_tutorial") #Note: windows users would use setwd("C:\Users\YourUserName\Desktop\BMMB521_tutorial")

2.3.1 Reads

I am providing the sequencing data as a single compressed directory (a gzipped tarball). You could download it via R as below, or using your web browser by copying the link.

download.file("https://github.com/jbisanz/BMS270_SP2021/raw/main/BMS270_SP2021_reads.tar.gz", "BMMB521_reads.tar.gz")

Now that we have downloaded the reads, we can unpack the folder as below:

untar("BMMB521_reads.tar.gz")

You will now see that a folder called reads has appeared. We can look at them in R using list.files().

list.files("reads")
##  [1] "MOUSE.G11.Day1_1.fastq.gz"  "MOUSE.G11.Day1_2.fastq.gz" 
##  [3] "MOUSE.G11.Day14_1.fastq.gz" "MOUSE.G11.Day14_2.fastq.gz"
##  [5] "MOUSE.G11.Day21_1.fastq.gz" "MOUSE.G11.Day21_2.fastq.gz"
##  [7] "MOUSE.G11.Day7_1.fastq.gz"  "MOUSE.G11.Day7_2.fastq.gz" 
##  [9] "MOUSE.G12.Day1_1.fastq.gz"  "MOUSE.G12.Day1_2.fastq.gz" 
## [11] "MOUSE.G12.Day14_1.fastq.gz" "MOUSE.G12.Day14_2.fastq.gz"
## [13] "MOUSE.G12.Day21_1.fastq.gz" "MOUSE.G12.Day21_2.fastq.gz"
## [15] "MOUSE.G12.Day7_1.fastq.gz"  "MOUSE.G12.Day7_2.fastq.gz" 
## [17] "MOUSE.G21.Day1_1.fastq.gz"  "MOUSE.G21.Day1_2.fastq.gz" 
## [19] "MOUSE.G21.Day14_1.fastq.gz" "MOUSE.G21.Day14_2.fastq.gz"
## [21] "MOUSE.G21.Day21_1.fastq.gz" "MOUSE.G21.Day21_2.fastq.gz"
## [23] "MOUSE.G21.Day7_1.fastq.gz"  "MOUSE.G21.Day7_2.fastq.gz" 
## [25] "MOUSE.H11.Day1_1.fastq.gz"  "MOUSE.H11.Day1_2.fastq.gz" 
## [27] "MOUSE.H11.Day14_1.fastq.gz" "MOUSE.H11.Day14_2.fastq.gz"
## [29] "MOUSE.H11.Day21_1.fastq.gz" "MOUSE.H11.Day21_2.fastq.gz"
## [31] "MOUSE.H11.Day7_1.fastq.gz"  "MOUSE.H11.Day7_2.fastq.gz" 
## [33] "MOUSE.H12.Day1_1.fastq.gz"  "MOUSE.H12.Day1_2.fastq.gz" 
## [35] "MOUSE.H12.Day14_1.fastq.gz" "MOUSE.H12.Day14_2.fastq.gz"
## [37] "MOUSE.H12.Day21_1.fastq.gz" "MOUSE.H12.Day21_2.fastq.gz"
## [39] "MOUSE.H12.Day7_1.fastq.gz"  "MOUSE.H12.Day7_2.fastq.gz" 
## [41] "MOUSE.H21.Day1_1.fastq.gz"  "MOUSE.H21.Day1_2.fastq.gz" 
## [43] "MOUSE.H21.Day14_1.fastq.gz" "MOUSE.H21.Day14_2.fastq.gz"
## [45] "MOUSE.H21.Day21_1.fastq.gz" "MOUSE.H21.Day21_2.fastq.gz"
## [47] "MOUSE.H21.Day7_1.fastq.gz"  "MOUSE.H21.Day7_2.fastq.gz"

2.3.2 Metadata

Now that we have the reads, we need to know which samples they belong to! This is called metadata, and a spreadsheet (in tab separated format) is available to download as below. Note that you could just as easily use excel spread sheets; however, there are some special considerations for reading/writing excel files and they love to auto-convert text to dates which can cause issues.

download.file("https://github.com/jbisanz/BMS270_SP2021/raw/main/metadata.tsv","metadata.tsv")

We can now take a quick look at the metadata after reading it in:

read_tsv("metadata.tsv")
## # A tibble: 24 × 6
##    SampleID Group    MouseID Time_days Read1                            Read2   
##    <chr>    <chr>    <chr>       <dbl> <chr>                            <chr>   
##  1 G11_d1   pre-diet G11             1 reads/MOUSE.G11.Day1_1.fastq.gz  reads/M…
##  2 G11_d14  pre-diet G11            14 reads/MOUSE.G11.Day14_1.fastq.gz reads/M…
##  3 G11_d21  pre-diet G11            21 reads/MOUSE.G11.Day21_1.fastq.gz reads/M…
##  4 G11_d7   pre-diet G11             7 reads/MOUSE.G11.Day7_1.fastq.gz  reads/M…
##  5 G12_d1   pre-diet G12             1 reads/MOUSE.G12.Day1_1.fastq.gz  reads/M…
##  6 G12_d14  pre-diet G12            14 reads/MOUSE.G12.Day14_1.fastq.gz reads/M…
##  7 G12_d21  pre-diet G12            21 reads/MOUSE.G12.Day21_1.fastq.gz reads/M…
##  8 G12_d7   pre-diet G12             7 reads/MOUSE.G12.Day7_1.fastq.gz  reads/M…
##  9 G21_d1   pre-diet G21             1 reads/MOUSE.G21.Day1_1.fastq.gz  reads/M…
## 10 G21_d14  pre-diet G21            14 reads/MOUSE.G21.Day14_1.fastq.gz reads/M…
## # ℹ 14 more rows

2.3.3 Taxonomy Database

Finally, we will download a database for taxonomic assignment (if we believe such a thing is valuable). We will be using the RDP database; however, we will discuss alternate (and perhaps better) choices during the tutorial.

download.file("https://zenodo.org/record/4310151/files/rdp_species_assignment_18.fa.gz", "rdp_species_assignment_18.fa.gz")
download.file("https://zenodo.org/record/4310151/files/rdp_train_set_18.fa.gz", "rdp_train_set_18.fa.gz")

After all of this, you should find a folder on your desktop called BMS270_tutorial. Within this folder, you should find a sub folder called reads containing 48 fastq files, your metadata.tsv file, and the two formatted rdp fa.gz files. With these and the packages installed, you are ready to begin!

Stop here before class!


3 Setting up your environment

Much like you might set up your bench before starting an experiment, or do your mise en place before cooking a meal, it pays to take some time to set up your digital work space before starting an analysis. You should always keep a record of all the tools and commands you used to perform your analysis. This is essential to reproducibility! At a minimum, you can create a new R script (File > New File > R markdown). R markdown documents combine raw code, with free-text sections and function like an electronic note book for your analysis. This is great to document your analysi and can be shared later with collaborators/reviewers/etc… guess what this document is? Visit here to see the markdown document for this class!

As a first pass we can load our libraries that we will need as below:

library(tidyverse)
library(dada2)
library(phyloseq)
library(ape)
library(vegan)
library(qiime2R)
library(lme4)

Next we can make sure we have set our working directory. This ensures we will be able to properly find our files. This was set up before class and is likely on your desktop called ~/Desktop/BMS270_tutorial. We can explicitly set our working directory with the setwd() function.

setwd("~/Desktop/BMMB521_tutorial")

Next we might want to set up some directories for our outputs as below:

dir.create("figures")
dir.create("backups")

Finally, it is always helpful to keep a record of your work environment. This can easily be accomplished with SessionInfo() as below. Having a list of all of the packages you used and the versions can be very helpful later when writing a manuscript!

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lme4_1.1-35.1   Matrix_1.6-5    qiime2R_0.99.6  vegan_2.6-4    
##  [5] lattice_0.21-9  permute_0.9-7   ape_5.7-1       phyloseq_1.46.0
##  [9] dada2_1.30.0    Rcpp_1.0.12     lubridate_1.9.3 forcats_1.0.0  
## [13] stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2     readr_2.1.5    
## [17] tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.0   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          rstudioapi_0.15.0          
##   [3] jsonlite_1.8.8              magrittr_2.0.3             
##   [5] NADA_1.6-1.1                nloptr_2.0.3               
##   [7] rmarkdown_2.25              zlibbioc_1.48.0            
##   [9] vctrs_0.6.5                 multtest_2.58.0            
##  [11] minqa_1.2.6                 Rsamtools_2.18.0           
##  [13] RCurl_1.98-1.14             base64enc_0.1-3            
##  [15] htmltools_0.5.7             S4Arrays_1.2.0             
##  [17] truncnorm_1.0-9             Rhdf5lib_1.24.2            
##  [19] SparseArray_1.2.4           Formula_1.2-5              
##  [21] rhdf5_2.46.1                sass_0.4.8                 
##  [23] bslib_0.6.1                 htmlwidgets_1.6.4          
##  [25] plyr_1.8.9                  cachem_1.0.8               
##  [27] GenomicAlignments_1.38.2    igraph_2.0.2               
##  [29] lifecycle_1.0.4             iterators_1.0.14           
##  [31] pkgconfig_2.0.3             R6_2.5.1                   
##  [33] fastmap_1.1.1               GenomeInfoDbData_1.2.11    
##  [35] MatrixGenerics_1.14.0       digest_0.6.34              
##  [37] colorspace_2.1-0            ShortRead_1.60.0           
##  [39] S4Vectors_0.40.2            Hmisc_5.1-1                
##  [41] GenomicRanges_1.54.1        hwriter_1.3.2.1            
##  [43] fansi_1.0.6                 timechange_0.3.0           
##  [45] abind_1.4-5                 mgcv_1.9-0                 
##  [47] compiler_4.3.2              bit64_4.0.5                
##  [49] withr_3.0.0                 htmlTable_2.4.2            
##  [51] backports_1.4.1             BiocParallel_1.36.0        
##  [53] MASS_7.3-60                 DelayedArray_0.28.0        
##  [55] biomformat_1.30.0           tools_4.3.2                
##  [57] foreign_0.8-85              nnet_7.3-19                
##  [59] glue_1.7.0                  nlme_3.1-163               
##  [61] rhdf5filters_1.14.1         grid_4.3.2                 
##  [63] checkmate_2.3.1             cluster_2.1.4              
##  [65] reshape2_1.4.4              ade4_1.7-22                
##  [67] generics_0.1.3              gtable_0.3.4               
##  [69] tzdb_0.4.0                  data.table_1.15.0          
##  [71] hms_1.1.3                   utf8_1.2.4                 
##  [73] XVector_0.42.0              BiocGenerics_0.48.1        
##  [75] foreach_1.5.2               pillar_1.9.0               
##  [77] vroom_1.6.5                 splines_4.3.2              
##  [79] bit_4.0.5                   survival_3.5-7             
##  [81] deldir_2.0-2                tidyselect_1.2.0           
##  [83] Biostrings_2.70.2           knitr_1.45                 
##  [85] gridExtra_2.3               IRanges_2.36.0             
##  [87] SummarizedExperiment_1.32.0 zCompositions_1.5.0-1      
##  [89] stats4_4.3.2                xfun_0.42                  
##  [91] Biobase_2.62.0              matrixStats_1.2.0          
##  [93] DT_0.32                     stringi_1.8.3              
##  [95] yaml_2.3.8                  boot_1.3-28.1              
##  [97] evaluate_0.23               codetools_0.2-19           
##  [99] interp_1.1-6                cli_3.6.2                  
## [101] RcppParallel_5.1.7          rpart_4.1.21               
## [103] munsell_0.5.0               jquerylib_0.1.4            
## [105] GenomeInfoDb_1.38.6         png_0.1-8                  
## [107] parallel_4.3.2              latticeExtra_0.6-30        
## [109] jpeg_0.1-10                 bitops_1.0-7               
## [111] scales_1.3.0                crayon_1.5.2               
## [113] rlang_1.1.3

Before going forward, we will read in our metadata table with the help of the read_tsv() function. At any point you can click on your data in the top right corner of R studio and it will open in a tab for you to look at!

metadata<-read_tsv("metadata.tsv")

We can then explore it as below:

metadata %>% count(Group, Time_days)
## # A tibble: 8 × 3
##   Group     Time_days     n
##   <chr>         <dbl> <int>
## 1 post-diet         1     3
## 2 post-diet         7     3
## 3 post-diet        14     3
## 4 post-diet        21     3
## 5 pre-diet          1     3
## 6 pre-diet          7     3
## 7 pre-diet         14     3
## 8 pre-diet         21     3

You can see that we have 2 groups (pre- and post-diet), 4 time points (1, 7, 14, and 21 days), and that there are 3 mice per time point. The last two columns are storing the location of the sequencing data).

While we are at it, we are going to define an explicit order to the groups which will be reflected in all future graphs/analyses. We will place pre-diet before post-diet:

metadata <- metadata %>% mutate(Group=factor(Group, c("pre-diet","post-diet")))

4 Processing Raw Data

The most standard format for sequencing data is called a FastQ. It is composed of a repeating pattern of 4 lines as below:

@M01869:152:000000000-AMWM8:1:1101:21609:1687 1:N:0:0
TACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGCATGGCCG...
+
BBCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGC,-...

The first line contains a unique identifier for each read starting with an @ sign. This data which comes from an Illumina MiSeq shows the machine’s serial number (M01869), the flow cell ID (AMWM8; unique to the sequencing run), and the read’s physical location (1:1101:21609:1687) on the flow cell. The next line stores the actual DNA sequence for the read which is followed by a + sign. The last line of the pattern stores quality information wherein the probability of an incorrect base call is represented by a letter (which in turn represents a number as seen here).

4.1 Data Quality

Given that it is hard to quite hard to read the quality Profile by eye, we can instead plot it using the plotQualityProfile() function. Here we are going to grab the first 3 samples [1:3] from each of the columns and add a title to the plot ggtitle().

plotQualityProfile(metadata$Read1[1:3]) + ggtitle("Forward Read Quality (R1)")

plotQualityProfile(metadata$Read2[1:3]) + ggtitle("Reverse Read Quality (R2)")

As you can see these reads are VERY high quality over their entire length. See here for an example of bad reads that would require some more heavy handed QC.

4.2 Quality Filtering

Nonetheless, we should still do some QC which entails trimming back the ends of the reads and removing those which are suspected to be highly error prone.

dir.create("filtered_reads")

metadata<-
  metadata %>%
  mutate(Read1_filtered=paste0("filtered_", Read1)) %>%
  mutate(Read2_filtered=paste0("filtered_", Read2))
  

trimlog <- filterAndTrim(
                      fwd=metadata$Read1,
                      rev=metadata$Read2,
                      filt=metadata$Read1_filtered,
                      filt.rev=metadata$Read2_filtered,
                      compress=TRUE, # use gzip compression to save space
                      truncLen=c(240,140), # trim back the reads on the 3' end
                      trimLeft=5, # remove 5 bases from the beginning as these q scores may not be reliable
                      maxEE=2, # no more than 2 expected errors per reads
                      truncQ=2, # trim the read after a q score of 2
                      multithread=TRUE, # use multiple processors, you may need to set this to FALSE on PCs
                      rm.phix=TRUE, # Remove reads mapping to phiX
                      verbose=TRUE
                   )

QUESTION: What is phiX and why might we remove it?

Now we can check how many reads have been removed:

trimlog %>%
  as.data.frame() %>%
  mutate(Percent_Passed=reads.out/reads.in*100) %>%
  interactive_table()

4.3 Denoising Data

Amplicon sequencing tends to be very noisy. To deal with this, operational taxonomic units (OTUs) have been heavily used for analysis. In this type of analysis, sequences are clustered based on a % identity threshold, often 97%. This has the advantage of negating sequencing errors but causes a lack of resolution. 97% is generally based on the notion that the 16S rRNA of two members of the same species is >=97%; however, this is based on the whole 16S rRNA gene, and not a small fragment as we are commonly analyzing so this logic is not entirely sound.

As a more modern approach, denoising is commonly used to identify sequencing errors and correct them. As for how it works, a simple explanation is that errors are relatively rare and as such can be identified when the error is substitution from a more abundant sequence. For a more complete explanation and alternate approaches see the manuscripts describing Dada2, or any of the other popular denoising algorithms available:

The first step is to learn the error profile for the sequencing run. This is the longest part of the analysis and is based on a sampling of the data.

forward_error<-learnErrors(metadata$Read1_filtered, multithread = TRUE)
## 70572145 total bases in 300307 reads from 24 samples will be used for learning the error rates.
reverse_error<-learnErrors(metadata$Read2_filtered, multithread = TRUE)
## 40541445 total bases in 300307 reads from 24 samples will be used for learning the error rates.

Next we can actually correct the data:

forward_denoised<- dada(metadata$Read1_filtered, err=forward_error, multithread=TRUE)
## Sample 1 - 9202 reads in 2938 unique sequences.
## Sample 2 - 10794 reads in 3650 unique sequences.
## Sample 3 - 11819 reads in 3761 unique sequences.
## Sample 4 - 13017 reads in 3918 unique sequences.
## Sample 5 - 11273 reads in 3387 unique sequences.
## Sample 6 - 14387 reads in 4399 unique sequences.
## Sample 7 - 9528 reads in 3090 unique sequences.
## Sample 8 - 13337 reads in 3602 unique sequences.
## Sample 9 - 9263 reads in 3003 unique sequences.
## Sample 10 - 13389 reads in 4371 unique sequences.
## Sample 11 - 14108 reads in 4550 unique sequences.
## Sample 12 - 13718 reads in 4280 unique sequences.
## Sample 13 - 12655 reads in 3936 unique sequences.
## Sample 14 - 15039 reads in 4403 unique sequences.
## Sample 15 - 14517 reads in 3901 unique sequences.
## Sample 16 - 13289 reads in 4447 unique sequences.
## Sample 17 - 12728 reads in 4030 unique sequences.
## Sample 18 - 11805 reads in 3928 unique sequences.
## Sample 19 - 14642 reads in 4471 unique sequences.
## Sample 20 - 10214 reads in 3093 unique sequences.
## Sample 21 - 13699 reads in 3799 unique sequences.
## Sample 22 - 11244 reads in 3818 unique sequences.
## Sample 23 - 14812 reads in 4579 unique sequences.
## Sample 24 - 11828 reads in 3853 unique sequences.
reverse_denoised<- dada(metadata$Read2_filtered, err=reverse_error, multithread=TRUE)
## Sample 1 - 9202 reads in 2415 unique sequences.
## Sample 2 - 10794 reads in 3130 unique sequences.
## Sample 3 - 11819 reads in 3338 unique sequences.
## Sample 4 - 13017 reads in 3433 unique sequences.
## Sample 5 - 11273 reads in 2684 unique sequences.
## Sample 6 - 14387 reads in 3641 unique sequences.
## Sample 7 - 9528 reads in 2738 unique sequences.
## Sample 8 - 13337 reads in 3779 unique sequences.
## Sample 9 - 9263 reads in 2467 unique sequences.
## Sample 10 - 13389 reads in 3601 unique sequences.
## Sample 11 - 14108 reads in 4081 unique sequences.
## Sample 12 - 13718 reads in 3797 unique sequences.
## Sample 13 - 12655 reads in 3445 unique sequences.
## Sample 14 - 15039 reads in 3834 unique sequences.
## Sample 15 - 14517 reads in 3303 unique sequences.
## Sample 16 - 13289 reads in 3936 unique sequences.
## Sample 17 - 12728 reads in 3381 unique sequences.
## Sample 18 - 11805 reads in 3193 unique sequences.
## Sample 19 - 14642 reads in 3505 unique sequences.
## Sample 20 - 10214 reads in 2618 unique sequences.
## Sample 21 - 13699 reads in 3157 unique sequences.
## Sample 22 - 11244 reads in 3212 unique sequences.
## Sample 23 - 14812 reads in 4046 unique sequences.
## Sample 24 - 11828 reads in 3228 unique sequences.

4.4 Overlap Reads

Up until this point, we have been treating our forward and reverse reads separately. We can now merge them together to represent the complete V4 sequence using merge pairs.

merged_reads <- mergePairs(
                          dadaF=forward_denoised, 
                          derepF=metadata$Read1_filtered, 
                          dadaR=reverse_denoised, 
                          derepR=metadata$Read2_filtered, 
                          verbose=TRUE
                          )

4.5 Build the Feature Table

Now we can get can get the output we really want: a table with each sample and the number of times each sequence is observed.

merged_table <- makeSequenceTable(merged_reads)

4.6 Remove Chimeras

When dealing with amplicon data, it is incredibly important to remove chimeric reads. These sequences are generated through a number of mechanisms but result in a read that is derived from multiple original pieces of template DNA. Certain protocols for generating 16S rRNA amplicons, including the methods used to generate this data, are highly prone to chimera formation. For more information on how to prevent their formation, see the manuscript by Gohl et al. 2016. For a discussion of chimeras, see here.

asv_table<-removeBimeraDenovo(merged_table, method="pooled", multithread=TRUE, verbose=TRUE)

Question: How many ASVs did we remove? Do they represent a lot of our data?

4.7 Clean up

Frustratingly, some people like to have samples as rows and features (ASVs) as columns, others like the reverse. Different tools have different expectations and you must always be aware which orientation is expected. I, unlike the authors of Dada2, prefer samples as columns. The good news is that we can easily transpose the table back and forth using the t() function.

asv_table<-t(asv_table)

If you look at the sample names in your table, it is the currently the name of the fastq file rather than the sample ID, we can switch them out as below.

asv_table<-asv_table[,basename(metadata$Read1_filtered)] #put columns in the same order as the metadata
colnames(asv_table)<-metadata$SampleID #overwrite the column names

You can also see in our table that the ASVs are being stored as their literal sequence. We can swap this out with something a little bit more readable as below while storing a copy of the sequences:

sequences<-tibble(ASV=paste0("ASV_",1:nrow(asv_table)), Sequence=rownames(asv_table))
rownames(asv_table)<-sequences$ASV

We can save an image of our session just in the event of a problem so that we can go back and reload our computed data:

Finally, to free up some memory, we can remove all the intermediate data.

rm(forward_denoised, forward_error, merged_reads, merged_table, reverse_denoised, reverse_error, trimlog)

5 Alpha Diversity

One of the most obvious to describe a community is in terms of its diversity. While we probably all intuitively understand the concept, how might we actually define it quantitatively?

Consider the following two samples, which would you say is more diverse?

Organism SampleA SampleB
E. coli 99 50
L. rhamnosus 1 50
S. aureus 1 0

Perhaps the simplest metric you could think of is the number of species (or in our case ASVs) present in an ecosystem which is commonly applied. If we apply this metric to our fake data above, we would conclude that SampleA is more diverse (it has 3 organisms present vs 2). Others are a little bit more complex such as Shannon Diversity which use both the number of species present and their evenness. In the case of the fake data above, Sample B is more diverse (0.69 vs 0.11). Today we will use the number of observed ASVs and the Shannon diversity on our actual data. Note: in the vast majority of cases these values will be highly correlated.

Before going forward, we also need to consider the effect of sampling depth and account for it by subsampling/rarefying the data. This is a way to normalize for sequencing depth in silico which is appropriate here, but may not always be advisable.

Question: how even is our sequencing depth across samples?

In this next code chunk our goal is to calculate our diversity metrics and then put them in a table with our metadata.

asv_table_subsample<-subsample_table(asv_table)

diversity_table<-
  data.frame(
    Shannon=diversity(asv_table_subsample, index="shannon", MARGIN=2),
    Obs_ASVs=specnumber(asv_table_subsample, MARGIN=2)
  ) %>%
  rownames_to_column("SampleID") %>%
  full_join(metadata) %>%
  select(SampleID, Group, MouseID, Time_days, Shannon, Obs_ASVs)

Now that we have our table made, we can start making our plot.

diversity_table %>%
  pivot_longer(c(Shannon, Obs_ASVs), names_to="Metric", values_to="Diversity") %>%
  ggplot(aes(x=Time_days, y=Diversity, color=Group)) +
  stat_summary(geom="line") +
  stat_summary(geom="errorbar", width=0.5, fun.data=mean_sd) +
  geom_jitter(width=1, height=0) +
  facet_wrap(~Metric, scales="free") +
  theme_q2r() +
  xlab("Days Post Colonization") +
  ylab("Alpha Diversity") +
  scale_color_manual(values=c("cornflowerblue","indianred"))

ggsave("figures/diversity.pdf", height=3, width=5)

Question: What trends do you see in the data?

Now we can do a statistical test accounting for the nature of the data. Let’s start by asking if the number of ASVs is significantly different 1 day after colonization?

diversity_table %>%
  filter(Time_days==1) %>%
  t.test(Shannon~Group, data=.)
## 
##  Welch Two Sample t-test
## 
## data:  Shannon by Group
## t = -0.5958, df = 3.9292, p-value = 0.5839
## alternative hypothesis: true difference in means between group pre-diet and group post-diet is not equal to 0
## 95 percent confidence interval:
##  -0.3515645  0.2280652
## sample estimates:
##  mean in group pre-diet mean in group post-diet 
##                2.840284                2.902034

Now what if instead, we wanted to ask about the change over time? We can used a mixed model to account for repeated measures… but what distribution is the count of ASVs most likely to resemble?

fit<-glmer.nb(Obs_ASVs~Group*Time_days+(1|MouseID), data=diversity_table)
summary(fit)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: Negative Binomial(175.6883)  ( log )
## Formula: Obs_ASVs ~ Group * Time_days + (1 | MouseID)
##    Data: diversity_table
## 
##      AIC      BIC   logLik deviance df.resid 
##    186.5    193.6    -87.3    174.5       18 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.78910 -0.58126 -0.04438  0.58822  2.00273 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  MouseID (Intercept) 0        0       
## Number of obs: 24, groups:  MouseID, 6
## 
## Fixed effects:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               3.990630   0.076686  52.039   <2e-16 ***
## Grouppost-diet           -0.061274   0.109913  -0.557   0.5772    
## Time_days                 0.017814   0.005611   3.175   0.0015 ** 
## Grouppost-diet:Time_days -0.004863   0.008076  -0.602   0.5470    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Grpps- Tm_dys
## Grouppst-dt -0.698              
## Time_days   -0.839  0.585       
## Grppst-d:T_  0.583 -0.836 -0.695
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

Question: What trends are significant?

Question: What conclusions do you draw from this diversity analysis? Do you think this explains our weight loss phenotype?

Question: Did you get an error message about a singular fit? Should this be ignored?


6 Beta Diversity

Beta diversity refers to a way of measuring similarity between samples. These are often called distances or dissimilarities. Conceptually it is easy to measure the distance between two physical points, but how do we measure the distance between two compositions? There is an entire universe of metrics used for this purpose, but today we will use a newer metric called Aitchison distance, also called CLR Euclidean. This metric comes from the field of compositional data analysis (CoDA) and deals with a number of shortcomings of more traditional metrics were developed for macro-ecology. Read more here.

clr_euc<-asv_table %>% make_clr() %>% t() %>% vegdist(method="euclidean")

Now we can look at our distances:

as.matrix(clr_euc) %>% corner()
##           G11_d1  G11_d14  G11_d21   G11_d7   G12_d1
## G11_d1   0.00000 63.36958 63.84686 62.11805 25.94286
## G11_d14 63.36958  0.00000 28.77995 32.65678 66.31680
## G11_d21 63.84686 28.77995  0.00000 35.05762 67.37840
## G11_d7  62.11805 32.65678 35.05762  0.00000 65.37693
## G12_d1  25.94286 66.31680 67.37840 65.37693  0.00000

We have calculated our distances; however, now we need a human-readable way to interpret them. We need a method for reducing the number of dimensions to something that can be plotted in 2 or 3 dimensions. There are a number of methods, of which you may be familiar with including PCA, PCoA, tSNE, NMDS, etc. Today we will use PCoA which is fairly versatile and can work with a number of microbiome-relevant distance metrics.

pc<-pcoa(clr_euc)

We can now visualize the plot as below:

pc$vectors %>%
  as.data.frame() %>%
  rownames_to_column("SampleID") %>%
  left_join(metadata) %>%
  arrange(Time_days) %>%
  ggplot(aes(x=Axis.1, y=Axis.2, group=MouseID, color=Group, label=Time_days)) +
  geom_path(arrow=arrow(type="closed", length=unit(0.1,"cm"))) +
  #geom_text() +
  theme_q2r() +
  scale_color_manual(values=c("cornflowerblue","indianred")) +
  xlab(paste("PC1:",round(pc$values$Relative_eig[1]*100,2),"%")) +
  ylab(paste("PC2:",round(pc$values$Relative_eig[2]*100,2),"%"))

ggsave("figures/pcoa.pdf", height=2, width=3)  

We can also do a statistical test for support using Adonis/Permanova:

adonis2(clr_euc~Group*Time_days, data=metadata , strata = metadata$MouseID, permutations = 9999)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Blocks:  strata 
## Permutation: free
## Number of permutations: 9999
## 
## adonis2(formula = clr_euc ~ Group * Time_days, data = metadata, permutations = 9999, strata = metadata$MouseID)
##                 Df SumOfSqs      R2      F Pr(>F)    
## Group            1     7293 0.20314 7.5080 0.0002 ***
## Time_days        1     7066 0.19681 7.2741 0.0002 ***
## Group:Time_days  1     2115 0.05891 2.1774 0.0771 .  
## Residual        20    19427 0.54113                  
## Total           23    35901 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question: What trend do you see in this data? What would you conclude about the similarity between diet groups over time?

Question: What happens if we use a different distance metric?


7 Taxonomic Analysis

So we have clearly seen that pre- and post-diet mice have different microbiotas… but we don’t actually know what makes them different yet. In this section we will assign the ASVs to known taxa and create some plots to help us conduct some exploratory analysis.

7.1 Assigning Taxonomy

There are multiple choices of database for this tool. For the purposes of today, we will use the Ribosomal Database Project as it is considerably smaller in size and will be fairly fast for our purposes today. I would recommend the SILVA database which is far more inclusive and has up to date taxonomy. There are two steps

taxonomy <- assignTaxonomy(sequences$Sequence, "rdp_train_set_18.fa.gz", multithread=TRUE)
taxonomy <- addSpecies(taxonomy, "rdp_species_assignment_18.fa.gz", allowMultiple = TRUE)

taxonomy<-
  taxonomy %>%
  as.data.frame() %>%
  rownames_to_column("Sequence") %>%
  left_join(sequences) %>%
  select(ASV, everything()) %>%
  select(-Sequence) %>%
  column_to_rownames("ASV")

Question: How many ASVs were assigned to a species?

Question: Why would we allow multiple species assignments?

7.2 Stacked Bar Plot

The stacked bar plot is a staple of microbiome manuscripts. We can easily create one as below:

taxa_sums<-summarize_taxa(asv_table, taxonomy)

taxa_barplot(features = taxa_sums$Genus, metadata = metadata, category = "Group")

ggsave("figures/barplot.pdf", height=3, width=8)

Question: What conclusion can you draw from this plot?

Question: What are the limitations of this plot?

Question: How might we make a better plot to compare specific phyla?

7.3 Heatmap

Depending on the nature of the data, trends can sometimes be spotted in heat maps. Let’s create one.

taxa_heatmap(
              features = taxa_sums$Genus,
              metadata = metadata,
              category = "Group"
            )

ggsave("figures/heatmap.pdf", height=4, width=6)

Question: What conclusion can you draw from this plot?

Question: What are the limitations of this plot?

8 Differential Abundance Testing

Ultimately, what people often care most about is individual features/strains. After all, scientists tend to take a reductionist approach and want to identify novel pathogens or new probiotics in microbiome data. While fundamentally, 16S rRNA gene sequencing is a tool designed to look at communities as a whole, many approaches exist for looking at individual features. There are also a wide range of approaches and ethos for these. Generally speaking, it is a good idea to use multivariate statistics such as the beta diversity analysis above before looking for differences in individual taxa.

While there is not necessarily a single correct way to find significant taxa, there is certainly a wrong way: Do not take raw sequence counts and do a series of uncorrected t-tests. Remember that data is a technical sampling (if subsampled) of a technical sampling (which DNA binds to sequencing flow cell), of a technical sampling (which DNA was amplified by PCR primers), of a technical sampling (which portion of a sample was extracted for DNA), of a biological sampling (one arbitrary sample collection from an individual). As such it is important to realize that if you were to analyze replicate samples at every step of the process you would end up with slightly different numbers (but hopefully the same trends). Even the same sequencing library sequenced multiple times will yield slightly different results. Some tools are more false positive prone then others, and some create better visualization than others. A great preprint contrasting a number of tools was recently published: Microbiome differential abundance methods produce disturbingly different results across 38 datasets.

It is also important to consider the nature of your data set. In our case, we have a repeated sampling design over time, and as such it would not be kosher to treat every sample as an independent unit of measurement. As we used for the alpha diversity analysis, we will use a linear mixed effects model after applying a centered log ratio transformation which helps normalize the data to be appropriate for this analysis. An alternate approach which is conceptually similar is available in ANCOMII.

As a first step, we can remove features that are very rare as they won’t be helpful and every additional test performed reduces our power to detect significant changes when we consider multiple testing corrects.

asv_table_filtered<-filter_features(asv_table, minsamples = 3, minreads = 10)

Question: Does this seem like an appropriate filter? What happens if you change the filtering criteria?

Now we can generate our linear mixed effects model on a per-feature basis:

results<-
asv_table_filtered %>%
  make_clr() %>%
  as.data.frame() %>%
  rownames_to_column("ASV") %>%
  pivot_longer(!ASV, names_to="SampleID", values_to="CLR") %>%
  left_join(metadata) %>%
  group_by(ASV) %>%
  do(
    lmerTest::lmer(CLR~Group*Time_days+(1|MouseID), data=.) %>%
    summary() %>%
    .$coefficients %>%
    as.data.frame() %>%
    rownames_to_column("Term")
  )

Now we can filter for organisms which are significantly different between pre-diet and post-diet (the group term).

clean_results<-
  results %>%
  filter(Term=="Grouppost-diet") %>%
  ungroup() %>%
  mutate(FDR=p.adjust(`Pr(>|t|)`, method="fdr")) %>%
  mutate(Significant=if_else(FDR<0.1 & abs(Estimate)>1, "Significant","ns")) %>%
  left_join(taxonomy %>% rownames_to_column("ASV"))

interactive_table(clean_results)

Question: How many ASVs are significantly higher in the post-diet group?

Question: What term would we look at if we were interested in the organisms which change over time differentially between groups?

One way to represent the this data for exploring the number of significant results is with a volcano plot where we plot fold change against P-value as below:

clean_results %>%
  ggplot(aes(x=Estimate, y=-log10(`Pr(>|t|)`), color=Significant)) +
  geom_point(alpha=0.5, shape=16) +
  xlab("log2(fold change)") +
  ylab("-log10(P-value)") +
  theme_q2r() +
  scale_color_manual(values=c("grey50","indianred")) +
  theme(legend.position="none")

ggsave("figures/volcanoplot.pdf", height=2, width=2)

Question: How many ASVs are significantly higher in the post-diet group?

Question: Which ASVs do you think are most important?

Question: What would your next analysis or experiment be?

9 Final Challenge

Your last task before class is over is to pick one of the ASVs that you think is most interesting and plot its abundance over time like in the plot below: